-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add example: <Generate Embeddings> and <Embedding Similarity Search> #274
Conversation
Codecov Report
@@ Coverage Diff @@
## master #274 +/- ##
==========================================
+ Coverage 97.02% 97.05% +0.02%
==========================================
Files 17 18 +1
Lines 705 712 +7
==========================================
+ Hits 684 691 +7
Misses 15 15
Partials 6 6
|
Hey, thank you for this PR! |
@sashabaranov Hello, I believe that this example is what many developers need. It is a basic case of using Embedding for semantic search. I hope you can review and approve it. Thank you. |
@aceld could you please make changes based on the comments above? Would love to merge this PR after the changes are made |
@sashabaranov Which comment do you want me to edit? |
@aceld both! |
Duh, I think comments were in draft stage and not published, sorry! |
@sashabaranov done. |
embeddings_utils.go
Outdated
// Calculate dot product | ||
dot := DotProduct(v1, v2) | ||
// Calculate magnitude of v1 | ||
v1Magnitude := math.Sqrt(float64(DotProduct(v1, v1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embeddings are normalized to length 1, so we don't need to do that. CosineSimilarity is equal to DotProduct in this case.
https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sashabaranov OK I just delete the function CosineSimilarity and it’s over. Anyway, this is not used in the example.
embeddings_test.go
Outdated
v2 := []float32{2, 4, 6} | ||
expected := float32(28.0) | ||
result := DotProduct(v1, v2) | ||
if result != expected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't compare floats like that https://bitbashing.io/comparing-floats.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sashabaranov So how to compare? Can you provide some case parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aceld something like
func isClose(a, b float32) bool {
if a == b {
return true
}
return math.Abs(float64(a-b)) < 1e-12
}
@aceld if you have some time — maybe let's remove all README changes, fix float comparison and merge! |
@sashabaranov I apologize for not being able to find time to make the changes recently. I am sorry for any inconvenience caused, and I will try my best to allocate time next week to fix this issue. |
@sashabaranov |
@aceid hey, you imports are not sorted properly. Please refer to https://golangci-lint.run/ for more information |
@sashabaranov done! |
package openai | ||
|
||
// DotProduct Calculate dot product of two vectors. | ||
func DotProduct(v1, v2 []float32) float32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put it as an Embedding method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sashabaranov like you say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, but we don't really have []float32
vectors in the library except for Embedding
struct. Might make sense to add it as func (e Embedding) DotProduct(another Embedding)
file, err := os.Create("embeddings.bin") | ||
if err != nil { | ||
fmt.Printf("Create file error: %v\n", err) | ||
return | ||
} | ||
defer file.Close() | ||
|
||
encoder := gob.NewEncoder(file) | ||
err = encoder.Encode(selectionsEmbeddings) | ||
if err != nil { | ||
fmt.Printf("Encode error: %v\n", err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think file I/O and marshalling is largely out of scope to the purpose of this example. Could you please remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sashabaranov Sure, you're right. Do you have any suggestions on how to store vector data more efficiently? I would appreciate some advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aceld I think the storage of vector data is largely is out of scope for this README — the point is just to show an example, not to build vector-search DB.
input := "I am a Golang Software Engineer, I like Go and OpenAI." | ||
|
||
// get embedding of input | ||
inputEmbd, err := getEmbedding(ctx, client, []string{input}) | ||
if err != nil { | ||
fmt.Printf("GetEmedding error: %v\n", err) | ||
return | ||
} | ||
|
||
// Calculate similarity through cosine matching algorithm | ||
var questionScores []float32 | ||
for _, embed := range allEmbeddings { | ||
// OpenAI embeddings are normalized to length 1, which means that: | ||
// Cosine similarity can be computed slightly faster using just a dot product | ||
score := openai.DotProduct(embed, inputEmbd) | ||
questionScores = append(questionScores, score) | ||
} | ||
|
||
// Take the subscripts of the top few selections with the highest similarity | ||
sortedIndexes := sortIndexes(questionScores) | ||
sortedIndexes = sortedIndexes[:3] // Top 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please add this section to the previous example and have one single example for embeddings?
It's fantastic how can I help this pr to be merged? @aceld @sashabaranov ? |
@Leeaandrob yes, please feel free to fork it and continue in another PR! |
@sashabaranov @Leeaandrob The code conflicts have been resolved, and currently, all checks have passed. |
@aceld there are still a number of changes to be made mentioned above. To re-iterate:
|
No description provided.